Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBLAST score matrix builder v2 #69

Merged
merged 21 commits into from
Apr 12, 2022
Merged

NBLAST score matrix builder v2 #69

merged 21 commits into from
Apr 12, 2022

Conversation

clbarnes
Copy link
Collaborator

Rising from the grave...

Differences from #28 :

  • Actually handles half-open intervals properly rather than just ignoring the brackets
  • The lookup tables take Digitizer objects rather than boundaries. This better encapsulated the logic for parsing and stringifying the digitizers, and made the handling of left/right half-open boundaries and value clamping more flexible. It's a bit more verbose, but most people will be working with existing CSVs anyway so it'll all be hidden.
    • Digitizers can be created from linear sequences (useful for dot products), geometric sequences (useful for distances), and from quantiles of existing data
    • Digitizers force the interval to cover -inf to inf. By default they replace the first and last bounds with -inf and inf, but still store the original values so that they can be used like ScoringFunction.max_dist; alternatively they pre/append -inf and inf.
  • I haven't actually incorporated it into the {N,Syn}Blaster yet

And a big training dataset (L/R pairs in the L1 brain) is within reach thanks to @swilson27.

Open questions: is ScoringFunction sufficiently internal that I can just replace it with the new Lookup2d class, or should it be kept around for compatibility?

@schlegelp
Copy link
Collaborator

I think that ScoreFunction could reasonably be replaced.

@clbarnes
Copy link
Collaborator Author

Order of operations is another possible question. Currently, while building the lookup table, each query-target pair is assessed and its array of output scores digitized into the array of counts. This has a low memory overhead because we don't keep the raw scores around for very long; just compress it into the table as soon as possible.

However, it might be convenient to not define the digitizers until after we have all the raw scores - the boundaries could be learned from that data (alluded to in Digitizer.from_data()). Keeping all the raw scores around for digitizing later would obviously take much more memory, but realistically this probably isn't much of a concern, and might take some of the art out of defining ranges and boundaries.

@schlegelp
Copy link
Collaborator

schlegelp commented Nov 26, 2021

There might also be a bit of mileage if scoring was serialised. Currently that part of the process hardly registers but it might with more complex scoring functions and/or bigger matrices.

@schlegelp
Copy link
Collaborator

I was wondering if we should provide a system similar to the bridging transforms that users/downstream packages can use to register scoring functions. Users could then invoke NBLAST with e.g. nblast(..., score_func='L1').

@clbarnes
Copy link
Collaborator Author

I don't think there's much of a need to do that - the mapping from score function reference to the score function would just be a dict. If they've got a handle to the score function already, they can just use that, with no need to bring our opinions on an additional naming scheme in, or they can just maintain the dict themselves if they're working with many scoring functions at the same time.

@clbarnes
Copy link
Collaborator Author

I've just given this a go with the lateral pairs in seymour's brain. It doesn't take nearly as long as I thought it would to build the score matrix, and said matrix looks pretty sane, even with all the bins determined automatically. That is, if the points are close enough they don't really care about orientation, but if they're further away then the orientation starts to matter more. The distance boundaries increase in a somewhat geometric way: interestingly, the dot product boundaries are a sort of reverse geometric (the bins are large for low values and small for high values, i.e. we don't care whether they're quite different or very different orientations, but we do care whether they're similar or very similar).

@clbarnes clbarnes marked this pull request as ready for review November 30, 2021 15:03
@clbarnes
Copy link
Collaborator Author

I think this should be functional and documented, including an ipynb tutorial, but I'm not entirely sure whether I've got the indexing right for that - where could I find the PR build of the docs?

@schlegelp
Copy link
Collaborator

Quick question: I'd be keen to produce a new score matrix for synblast. I guess a 1-d (distance only) should straight forward but I'm trying to think how we could produce a scoring function with separate distances for pre- and postsynapses? For example, we might expect that post-synapses need to be closer than presynapses.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Dec 3, 2021

Yes, 1D is simple - just a LookupNd(Builder) with digitizers=[single_digitizer].

The clearest way to do it would probably be for pre and post to be two scoring functions. Possibly the SynBlaster object could take smat: Union[LookupNd, Tuple(LookupNd, LookupNd)]. This would be the only way to have different distance scales for the two.

However, if your distance scale was appropriate for both, and you just wanted them to drop into different slots, you could hack a Lookup2D for this purpose as well: if your point match function for two connector tables queried pre to pre and post to post then returned a tuple of (distance, synapse_type) where synapse_type is e.g. 1.0 for pre and -1.0 for post; your lookup table would then have 2 columns and a distance bin on each row.

Alternatively, you could query all synapses to all synapses, then your second axis could be "is the nearest synapse of the same type"; that column would basically just set to 0 if they're not. Or you could split it into 4 columns, in case there are different values for matching Q pre to T pre, Q pre to T post, Q post to T post, Q post to T pre.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Dec 7, 2021

The Digitizer now inherits from a base class (LookupAxis), and has a sibling SimpleLookup which can take non-float things. So your match function could return a float array of distances and a categorical array of synapse types, and the categories could be converted into indices with a SimpleLookup. For now the Lookup2d convenience class still only handles the pair of (float) Digitizers, because its {to,from}_strings() isn't necessarily well defined for the arbitrary types that SimpleLookup can work with.

I think this is ready to go in, but as it's a pretty significant change I'll hold off until you get a chance to take a look!

@schlegelp
Copy link
Collaborator

schlegelp commented Dec 7, 2021

Thanks! Will attempt to have a look Wed or Thu!

@schlegelp
Copy link
Collaborator

schlegelp commented Dec 8, 2021

Looks good to me. One thing that came to my mind though is this: as is, NBLAST will complain if data is not in microns. That made sense as long as there was only the FCWB score matrix but might become a nuisance/confusing if people use their own. I can see two options to deal with this:

  1. Remove the microns-check for anything but the FCWB score matrix (assuming that people will know appropriate units with custom matrices).
  2. Make data in microns a soft convention by (a) mentioning it in the tutorial and docstrings and (b) have LookupNdBuilder also complain if neurons aren't in microns.

I'm slightly in favour of option 1.

Otherwise happy for you to proceed and merge.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Dec 8, 2021

Digitizers could optionally have a unit defined on them on instantiation; the Nblaster could then check the Lookup's digitizers for such constraints.

This would work fine for current usage, I think. At the end of the day the match function, lookup table, and lookup axes are all pretty tightly bound, but the blasting algorithm would have to change a lot for this level of generic-ness to be a problem, I think.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Dec 8, 2021

Hm, this turned into a whole thing 😆 The sensible place to do unit checking is when dotprops are appended to the Nblaster, but partitioned nblasts where the same dotprops may be added to multiple nblasters could make this very noisy. I see check_microns jumps through some hoops to avoid parsing unit/quantity strings to often, too, which wouldn't be possible here.

Currently I'm going with adding a ignore_units kwarg to NBlaster.append.

To make things a little more complicated, and as I think you've found already, pint has Quantity("1 micron") == Quantity("1 micrometer"), but doesn't set them as an alias: i.e. Unit("micron") != Unit("micrometer"). You can manually set them as an alias in the registry (ureg.define("@alias micron = micrometer")), which fixes the __eq__ check but not the __hash__ - i.e. they come up as different in sets (which, say Unit("micrometer") and Unit("um") do not).

Non-isometric units are also a bit tricky, it would be nice to just unitise that all away before it gets anywhere close to NBLAST, but again, doing that multiple times would slow things down.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Dec 8, 2021

Maybe there could be force_units(neurons: Sequence[BaseNeuron], units: pint.Unit) -> NeuronList function, which

  1. adds the unit to dimensionless neurons (possibly with a warning)
  2. rescales neurons to isometrically use that unit
  3. Possibly try to figure out a consensus unit if the given units are dimensionless, or at least throw an error if there is unit heterogeneity among the neurons

This function could basically be used wherever check_microns is used now. If it were implemented on NeuronList then it could make use of existing infrastructure for parallelisation and so on.

@schlegelp
Copy link
Collaborator

schlegelp commented Dec 8, 2021

Oh haha, I didn't mean for you to go down that rabbit hole.

Re a force_units function: Not a fan of implicit conversion of units if I'm being honest - a warning for dummies like me who sometimes forget to convert should be sufficient.

Perhaps the NBlaster or LookupND could have a validate(self, neurons) or preflight(self, neurons) method which is called by e.g. nblast() before appending the neurons?

Re check_microns: yes, I converged on the current implementation after seeing that using the pint objects causes (a) issues and (b) slows things down a lot for large neuron lists. If you end up going with pint.Units I highly recommend you test with a couple thousand neurons at a time to make sure this won't become a huge bottleneck. FWIW: check_microns could also be generalised to something like check_units(NeuronList, units=('um', 'microns', 'micrometer')).

Re microns vs micrometer: in theory microns should never show up as units . See this line in the setter method:

v = v.replace('microns', 'um').replace('micron', 'um')

Sorry for the messy collection of thoughts. At the end of the day, I'd be perfectly happy with only checking units when smat='auto' (should perhaps rename to something like smat='FlyCircuit' actually).

@clbarnes
Copy link
Collaborator Author

clbarnes commented Apr 4, 2022

I put this aside as I rant into a couple of pint bugs, which have since been resolved:

These are due to be released any day now hgrecco/pint#1450 . However, that release also drops support for python 3.7 👀

@schlegelp
Copy link
Collaborator

FWIW: I'm game for dropping 3.7

This reverts commit 62b8094.
@clbarnes
Copy link
Collaborator Author

I've stripped out the unit-checking for now as I think this is viable without it; it could be added later with the new pint release (which has just come out).

@schlegelp
Copy link
Collaborator

👍 ready to merge when you are!

@clbarnes clbarnes merged commit 7437726 into navis-org:master Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants